Content
- Data
- Hierarchical clustering
- Dimentionality reduction: PCA
- Dimentionality reduction: t-SNE
Laurent Gatto
Laurent Gatto Computational Biology
https://lgatto.github.io de Duve Institute, UCLouvain
laurent.gatto@uclouvain.be @lgatt0
Slides: http://bit.ly/highdimvis
Source: https://github.com/lgatto/visualisation
| Sample 1 | Sample 2 | Sample 3 | … | group | |
|---|---|---|---|---|---|
| Protein 1 | 8.12 | 7.54 | 14.54 | … | A |
| Protein 2 | 10.55 | 11.46 | 11.17 | … | |
| Protein 3 | 7.49 | 12.21 | 8.14 | … | B |
| Protein 4 | 14.79 | 11.73 | 3.36 | … | A |
| … | 10.99 | 9.08 | 13.37 | … |
Sample-level visualisations using data from Mulvey et al. (2015) Dynamic Proteomic Profiling of Extra-Embryonic Endoderm Differentiation in Mouse Embryonic Stem Cells.
Protein-level visualisations using data from Christoforou et al. (2016) A draft map of the mouse pluripotent stem cell spatial proteome.
Hierarchical clustering methods start by calculating all pairwise distances between all features (or samples) and then clusters/groups these based on these similarities. There are various distances measures and clustering algorithms that can be used.
Plots prepared with dist and hclust from the stats package and mrkHClust from pRoloc.
When the data span over many dimensions (more than 2 or 3, up to thousands), it becomes impossible to easily visualise it in its entirety. Dimensionality reduction techniques such as PCA or t-SNE will transform the data into a new space that summarise properties of the whole data set along a reduced number of dimensions. These are then used to visualise the data along these informative dimensions or perform calculations more efficiently.
Principal Component Analysis (PCA) is a technique that transforms the original n-dimentional data into a new data space. Along these new dimensions, called principal components, the data expresses most of its variability along the first PC, then second, …. These new dimensions are linear combinations of the orignal data.
Figures produces with plot2D function from the pRoloc package.
t-Distributed Stochastic Neighbour Embedding (t-SNE) is a non-linear dimensionality reduction techique, i.e. that different regions of the data space will be subjected to different transformations. t-SNE will compress small distances, thus bringing close neighbours together, and will ignore large distances.
Figures produces with plot2D function from the pRoloc package.
t-SNE (as well as many other methods, in particular classification algorithms) has two important parameters that can substantially influence the clustering of the data
It is important to adapt these for different data.
This material is made available under the Creative Commons Attribution license.
You are free to share - copy and redistribute the material in any medium or format, and adapt - remix, transform, and build upon the material for any purpose, even commercially it. Attribution You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.